Recall, the testing set of a model cannot be used during the training process; it is meant to simulate how well a model would perform using unseen data. That means that we need to find a way to evaluate model performance before making predictions using the testing set. This chapter focuses on how we accomplish this task using resampling.
In the previous two chapters, we worked with the predictions from training a linear regression model. Now we’ll build another regression model using random forest.
# Fit
(rf_fit <- fit(rf_workflow, ames_train))
══ Workflow [trained] ══════════════════════════════════════════════════
Preprocessor: Formula
Model: rand_forest()
── Preprocessor ────────────────────────────────────────────────────────
sale_price ~ neighborhood + gr_liv_area + year_built + bldg_type +
latitude + longitude
── Model ───────────────────────────────────────────────────────────────
Ranger result
Call:
ranger::ranger(x = maybe_data_frame(x), y = y, num.trees = ~1000, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1))
Type: Regression
Number of trees: 1000
Sample size: 2342
Number of independent variables: 6
Mtry: 2
Target node size: 5
Variable importance mode: none
Splitrule: variance
OOB prediction error (MSE): 0.005315515
R squared (OOB): 0.82922
Now we’ll predict the training data (note, this is just used for demonstration to compare the two models).
Based on the resubstitution error rate, the random forest model is better at predicting the training data compared to the ordinary least squares regression. Now we’ll evaluate how well the random forest model performs on the test data.
Now we see that the RMSE is much higher and the coefficient of determination is lower- this is because the model did not generalize as well to new unseen data. In this context, the random forest model has low bias, but higher variance. For thoroughness, we can compute the same statistics for the ordinary least squares model on the test data.
The RMSE and coefficient of determination are almost the same between the training and test sets- this is because although not as accurate on the training set (i.e., higher bias), ordinary least squares models tend to have lower variance compared to more flexible models like random forest.
Resampling methods further split the training data into analysis sets used to train and tune the model and assessment sets used to evaluate model performance. The latter are almost like pseudo testing sets, but ensure that no data leakage occurs from the testing set into the model training process.
One resampling method is cross-validation, called V-fold cross-validation in the textbook. For each iteration, the data set is randomly split into V folds of relatively equal size. V-1 folds are used for training and 1 fold is retained for assessment. V iterations are performed such that for each iteration, a different fold is used as the holdout set, and the performance statistics are averaged over the V folds. In general, 10-fold cross-validation is used, which is what we will use here. In general, as the value of V decreases, bias increases and variance decreases; however, with larger values of V, bias decreases at the expense of an increase in variance.
(ames_folds <- vfold_cv(ames_train, v = 10))
# 10-fold cross-validation
The Central Limit Theorem states that with repeated sampling, summary statistics will converge toward a normal distribution. We can simulate sampling more data by performing repeat cross-validation, essentially by performing the V-fold cross-validation just described, repeated R times. We can easily simulate repeated V-fold cross-validation using the repeats argument in the vfold_cv() call.
vfold_cv(ames_train, v = 10, repeats = 5)
# 10-fold cross-validation repeated 5 times